<!DOCTYPE html>
<html lang="en">

<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<title>Load/Store Hoisting</title>
<link rel="stylesheet" type="text/css" href="https://gcc.gnu.org/gcc.css" />
</head>

<body>
<h1>Load/Store Hoisting</h1>

<p>August 19, 1998</p>

<p>Mark Mitchell Consulting
as part of a continuing contract with the Accelerated Strategic
Computing Initiative (ASCI) program at Los Alamos National Laboratory,
has contributed code to hoist loads from and stores to memory out of
loops.  Because many programs spend most of their time inside loops,
making loops go faster can result in dramatic improvements in the
execution time of the programs.</p>

<p>To see the utility of this new optimization, consider the following
function:</p>
<pre>
  void f(int* k, int j)
  {
    int i;

    for (i = 0; i &lt; j; ++i)
      *k = *k + 2 * ((*k) - 1);
  }
</pre>

<p>Without the new optimization, the code generated for the loop on an
x86 looked like:</p>
<pre>
.L5:
	movl (%ecx),%eax           # Load *k
	leal -2(%eax,%eax,2),%eax  # Compute *k + 2 * ((*k) - 1)
	movl %eax,(%ecx)           # Store *k
	incl %edx                  # Increment i
	cmpl %ebx,%edx             # Test i &lt; j
	jl .L5                     # Loop if i &lt; j
</pre>
<p>With the optimization, the code generated for the loop becomes:</p>
<pre>
	movl (%ebx),%eax           # Load *k
.L5:
	leal -2(%eax,%eax,2),%eax  # Compute *k + 2 * ((*k) - 1)
	incl %edx                  # Increment i
	cmpl %ecx,%edx             # Test i &lt; j
	jl .L5                     # Loop if i &lt; j
	movl %eax,(%ebx)           # Store *k
</pre>

<p>Note that in the second case the loads and stores occur outside the
loop.  In this particular case, we can expect the loop to run about
33% faster, since there are 4 instructions instead of 6.  (Of course,
the vagaries of today's deeply pipelined architectures can play havoc
with some estimates, so instruction count is only a rough guess at
execution time.)</p>

<p>Another example, of interest to the Fortran community, is code like:</p>
<pre>
   subroutine gemm(a, b, c, m, n, k)
   integer i,m,n,k,l,j
   dimension a(k,m),  b(n,k),  c(n,m)
   do i=1,m
     do j=1,n
       do l=1,k
	 c(j,i) = c(j,i) + a(l,i)*b(j,l)
       end do
     end do
   end do
   end
</pre>

<p>Here, the code generated for the inner loop will not load or store
<code>c(j,i)</code> inside the loop.  Instead, the initial value will
be loaded into a register, and the sum will be accumulated there.
After the loop is complete, the value will be stored back to memory.</p>

</body>
</html>
